Highly diverse and unknown viruses may enhance Antarctic endoliths’ adaptability

Background Rock-dwelling microorganisms are key players in ecosystem functioning of Antarctic ice free-areas. Yet, little is known about their diversity and ecology, and further still, viruses in these communities have been largely unexplored despite important roles related to host metabolism and nutrient cycling. To begin to address this, we present a large-scale viral catalog from Antarctic rock microbial communities. Results We performed metagenomic analyses on rocks from across Antarctica representing a broad range of environmental and spatial conditions, and which resulted in a predicted viral catalog comprising > 75,000 viral operational taxonomic units (vOTUS). We found largely undescribed, highly diverse and spatially structured virus communities which had predicted auxiliary metabolic genes (AMGs) with functions indicating that they may be potentially influencing bacterial adaptation and biogeochemistry. Conclusion This catalog lays the foundation for expanding knowledge of virosphere diversity, function, spatial ecology, and dynamics in extreme environments. This work serves as a step towards exploring adaptability of microbial communities in the face of a changing climate. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-023-01554-6.


Background
Viruses are among the most prevalent entities on our planet, with the ability to infect organisms across all domains [1]. Sequencing advances are reshaping understanding of viral diversity across Earth's diverse ecosystems, leading to a remarkable expansion of viral catalogs [1][2][3][4][5][6]. It is becoming clear that viruses play key roles in global biogeochemical cycles through the modulation of host population dynamics, and that the betterstudied pathogenic viruses represent only a small fraction of the virosphere [7][8][9]. Further, through auxiliary metabolic genes (AMGs), some viruses can directly impact host metabolism to improve fitness [10], including in terrestrial ecosystems characterized by extreme conditions (e.g., oligotrophy, aridity, high, or low temperature).
Antarctic ice-free areas include several of the most inhospitable regions on Earth, among which is the Mars counterpart [11]: the McMurdo Dry Valleys. In these locations, where rocks represent the main substratum, active life is possible for only a few specialized microorganisms; they survive by dwelling in porous rocks, forming self-sustaining ecosystems called endolithic communities [12,13]. These microorganisms are the primary life-forms present assuring the balance and functionality of these otherwise inert ecosystems. Recent studies have shed light on their biodiversity and adaptation, particularly the evolution of new and peculiar taxa spanning bacteria, fungi, and archaea [13][14][15][16]. However, the ecology and distribution of viral diversity from these communities remain wholly unknown and, to date, viral studies have instead focused on Antarctic freshwater lakes [17][18][19], surrounding oceans [20][21][22], and soils [23][24][25][26].
Here, we provide a large-scale viral catalog from 191 Antarctic endolith metagenomes. We sampled 37 localities across a broad range of environmental (e.g., 4 rock typologies, different altitudes and sun exposure) and spatial conditions (i.e., Antarctic Peninsula, Northern Victoria Land, and McMurdo Dry Valleys) (Table S1; Fig. 2A). We aimed to (i) untangle viral diversity in these communities, (ii) predict AMGs and how they may drive the fitness of their hosts, and (iii) explore ecological patterns (e.g., biogeography). This catalog is the first step toward understanding the role of viruses in the coldest and driest region on Earth. This information is also critical for elucidating the possible role of viruses in whole community adaptation in a dry ecosystem that will expand owing to global change [27].

Study area
One hundred ninety-one rocks colonized by endolithic communities were collected in thirty-eight sites in Antarctica including Antarctic Peninsula (n = 3), McMurdo Dry Valleys, Southern Victoria Land (n = 80), and Northern Victoria Land (n = 108) during more than 20 years of Italian Antarctic Expeditions. Different rock typologies (sandstone n = 141, granite n = 43, quartz n = 5, and basalt/dolerite n = 2) were sampled. Samples were collected along a latitudinal transect ranging from − 62.10008 − 58.51664 to − 77.874 160.739 at different environmental conditions namely sun exposure (northern sun exposed and southern shady rocks) and an altitudinal transect from sea level to 3100 m above sea level (a.s.l.) to provide a comprehensive overview of Antarctic endolithic diversity (Table S1). The presence of endolithic colonization was assessed by direct observation in situ. Rocks were excised using a geologic hammer and sterile chisel, and rock samples were preserved in sterile plastic bags, transported, and stored at -20 °C in the Culture Collection of Antarctic fungi of the Mycological Section of the Italian Antarctic National Museum (MNA-CCFEE), until downstream analysis.

Study data
In total, the dataset included 191 metagenomes, of which 100 have been assembled as described in Albanese et al. [14]. The remaining metagenomes were generated, sequenced, and assembled as described below. The final metagenomic set represented 149,585,625 metagenomic contigs.

DNA extraction, library preparation, and sequencing
Total community DNA was extracted from 1 g of crushed rocks using DNeasy PowerSoil Pro Kit (Qiagen, Germany), quality checked by electrophoresis using a 1.5% agarose gel and Nanodrop spectrophotometer (Thermofisher, USA) and quantified using the Qubit dsDNA HS Assay Kit (Life Technologies, USA) according to Coleine et al. [13]. Shotgun metagenomic sequencing paired-end libraries were constructed by using Next Ultra DNA library prep kits and sequenced as 2 × 150 bp using the Illumina NovaSeq platform (Illumina Inc., San Diego, CA, USA) at the Edmund Mach Foundation (San Michele all' Adige, Italy) and at the DOE Joint Genome Institute (JGI).
Viral sequences from all assemblies were combined and clustered into 95% similarity viral operational taxonomic units (vOTUs) using CD-HIT v. 4.8.1 [37] with the following parameters: -c 0.95 -aS 0.85 -M 0 -d 0. Prodigal v. 2.6.3 [38] was used to predict open reading frames in vOTUs using the -p meta option. VContact2 v. 0.9.19 was then run on predicted proteins from phage vOTUs and predicted proteins from the INPHARED August 2022 viral reference database to generate viral clusters (VCs) based on gene-sharing networks [39,40]. We assigned taxonomy to phage vOTUs based on VC membership as in Santos-Medellin et al. [41]. Predicted viral sequences and 95% similarity vOTUS are archived on Zenodo [42].

Viral host-prediction
Hosts were predicted for the phage sequences identified using (i) a database of complete genomes from NCBI RefSeq, and (ii) a previously published database of representative metagenome-assembled genomes (MAGs) from Antarctic endolith samples. To produce (i), we used "ncbi-genome-download" to download all complete bacterial (n = 25,984) and archaeal (n = 416) genomes, as of April 7, 2022, from NCBI RefSeq [43]. For (ii), we downloaded MAGs from Zenodo (https:// doi. org/ 10. 5281/ zenodo. 73135 91). We then used NCBI BLAST 2.12.0 + to convert these two databases into blast databases using "makeblastdb" and used "blastn" to compare vOTUs to these databases [44]. We filtered the blastn results in R based on existing thresholds [45][46][47]. Briefly, database matches had to share ≥ 2000 bp region with ≥ 70% sequence identity to the viral sequence and needed to have a bit score of ≥ 50 and minimum e value of 0.001.
Further to ensure matches did not represent partial or entirely viral contigs when searching against the MAG database, matches had to cover < 50% of the total MAG sequence length. As in Korthari et al. [46], only the top 5 hits matching these thresholds were considered, with host predictions made at each taxonomic level only if the taxonomy of all hits were in agreement. Discrepancies resulted in no host prediction for that taxonomic level. We then combined host predictions from both the Ref-Seq and MAG databases together; if there were discrepancies between the two databases, we defaulted to the MAG-based prediction.

Ecological analysis of vOTUs
We mapped reads from each metagenome to vOTUs using BBMap with a minimum sequence similarity of 90% to quantify vOTU relative abundance [48]. We then used SAMtools to convert resulting sam files to bam files and genomecov from BEDTools to obtain coverage information for each vOTU across each metagenome [49,50]. We then used bamM to parse bam files and calculate the trimmed pileup coverage (tpmean), which we used here in our analysis of viral relative abundance [51]. We removed vOTUs which displayed < 75% coverage over the length of the viral sequence and viral sequences < 10 kbp in length prior to downstream analyses in R [52]. Thresholds for analysis of vOTUs were based on community guidelines for length (i.e., ≥ 10 kbp), similarity (i.e., ≥ 95% similarity), and detection (i.e., ≥ 75% of the viral genome length covered ≥ 1 × by reads at ≥ 90% average nucleotide identity) [53,54]. To be conservative, we also removed vOTUs with a CheckV quality score of "not-determined" prior to downstream analysis. The viral abundance (tpmean), quality, taxonomy, and annotation results were imported, analyzed, and visualized in R using many packages including tidyverse and phyloseq [55,56]. Analysis scripts associated with this study are on GitHub and archived in Zenodo [57].
To compare viral diversity between metagenomes (i.e., beta diversity), we calculated the Hellinger distance, the Euclidean distance of Hellinger transformed abundance data. We performed Hellinger transformations using the transform function in the microbiome R package, calculated the Hellinger distance using the ordinate function in phyloseq, and then visualized these distances using principal-coordinate analysis (PCoA). We performed permutational multivariate analyses of variance (PERMANOVAs) with 9,999 permutations to test for significant differences in mean centroids using the model: Distance ~ Site + Rock type. Models were tested with "by = margins" and "by = terms" with all sequential combinations. We ran the ordistep and ordiR2step functions to help assess optimal parameters to include in the model. Since PERMANOVA tests are sensitive to differences in group dispersion, we also tested for significant differences in mean dispersions using the betadisper and permutest functions from the vegan package in R with 9,999 permutations.
To test for correlations between viral community distances (Hellinger distances) and geographic distances, we first subset the data to exclude metagenomes from the Antarctic Peninsula, and to account for variation between rock types, subset the data to include only metagenomes representing sandstone samples. We calculated geographical distances between metagenomes using the distm function in the geosphere package in R. We performed Mantel tests in the vegan R package to assess correlations between the community and geographic distances using a Spearman correlation and 9999 permutations. Mantel tests were repeated with exclusion of community distances when the geographic distance was zero to assess if patterns persisted in the absence of data from the same site.

Results and discussion
Using VirSorter2 [30], we predicted 101,085 viral sequences. We clustered these at 95% average nucleotide identity into 76,984 vOTUS [37]; we further used VContact2 [39] with INPHARED [40] reference genomes to cluster phage vOTUs into 7598 VCs, which approximate genus-level groupings based on gene-sharing networks. To keep analysis focused on the most robust catalog, we filtered this collection using community thresholds for length, detection, and quality (see "Methods" section) [31,53,54]. The final viral catalog represented 14,796 viral sequences (Table S2; 76 complete, 341 high-quality, 1539 medium-quality, 12,840 low-quality), including 2,695 prophage, which clustered into 11,806 vOTUs, of which 5743 phage vOTUs (7309 sequences) were successfully placed in 2286 VCs; the final catalog was predicted to predominantly be dsDNA phage, though 15.2% of vOTUs may represent eukaryotic viruses (i.e. nucleocytoplasmic large DNA viruses).
Our findings may indicate that Antarctic rock communities host highly diverse and novel phage populations, with only 1.8% (41 out of 2,286) of the VCs including reference sequences. The remaining 98.2% were unique VCs (i.e., did not include reference genomes), and could represent novel phage genera, greatly expanding the known diversity of viruses. Of Fig. 1 Antarctica is an underappreciated source of phage novelty. A Bar charts displaying the number of viral sequences placed in VCs colored by rock type (sandstone n = 141, granite n = 43, quartz n = 5, and basalt/dolerite n = 2) and divided by whether the VC is clustered with reference genomes. B Bar chart displaying host predictions colored by predicted host phylum. C Bar chart showing the number of predicted phage AMGs summarized by DRAM-v distilled metabolic categories the 41 VCs that did include reference genomes, the majority were assigned to the Caudoviricetes class (formerly Caudovirales order) of tailed double-stranded DNA bacteriophage ( Figure S1). Many genomes have not yet been reclassified, leaving viral taxonomy in flux; under the new schema, most of the 41 VCs are unclassified [58]. The majority of unique VCs are represented by viral sequences from sandstone communities (Fig. 1A), which represents an optimum substratum, in terms of rock traits (e.g., porosity), for endolithic colonization [59], but is also the most represented substratum in this work.
We further established host-virus linkages using NCBI BLAST against complete bacteria and archaea genomes from RefSeq, and Antarctic endolithic bacterial and archaeal metagenome-assembled genomes (MAGs) (see "Methods" section) [44][45][46] to explore the potential effects of viruses on host fitness, such as host-cell reprogramming through AMGs [60]. While we were unable to predict hosts for the majority of viral sequences (only 23.94% had a host prediction), we observed that Proteobacteria, Actinobacteriota, and Chloroflexota were the most commonly predicted host phyla (Fig. 1B), which are thought to be core members of these communities [14,61,62]. Using predictions against the Antarctic MAGs, we predicted hosts for an additional 16.5% of viral sequences compared to the 7.48% predicted using RefSeq alone ( Figure S2). Sequences that were clustered into VCs with reference data are labeled by their taxonomy, sequences clustered without reference genomes are labeled "Unique VC", while the rest are labeled based on their VContact2 status (i.e., singleton [share few or no genes with other genomes], overlap [share genes with genomes in multiple VCs], or outlier [share genes, but cannot confidently be placed in a VC]). C Principal-coordinate analysis (PCoA) visualization of Hellinger distances of viral communities. Samples are colored by site, with sites ordered by latitude, and have shapes based on geographic areas. D A scatter plot depicting a significant relationship between sandstone viral community beta diversity (Hellinger distance) and geographical distance (km) between sites We then sought to improve understanding on the functional profiles of retrieved phages using DRAM-v (Table S3) [32]. Notably, this catalog, which comprises metabolic novelty (39.3% of DRAM-v predicted AMGs had no distilled classification), may complement other available resources, which have largely been limited to coverage of humanrelated microbiomes (e.g. Li et al. [63]). Within identified functions, we found putative phage AMGs related to carbon, energy, and nitrogen metabolisms (Fig. 1C). Specifically, within carbohydrate metabolism, glycoside hydrolases, glycosyltransferases, and carbohydrate-binding domains predominated. Within nitrogen metabolism, methionine degradation was the most prevalent module, and within energy, the dominant modules were related to electron transport and photosynthesis. This highlights the need to connect vOTUs to Antarctic MAGs [14] and to implement complementary techniques (e.g., single-cell genomics) to provide a deeper understanding of virus-bacteria dynamics. More importantly, these findings suggest a possible complex role for viruses in element biogeochemical cycles in the rocks of Antarctica, which have traditionally been considered devoid of life.
Given the geographic spread of sampling (see "Methods" section and Table S1; Fig. 2A), we assessed whether this catalog could be useful to answer ecological questions related to viral community dynamics. While the dominant vOTUs at each site were taxonomically unclassified and largely members of unique VCs and thus possible novel genera (Fig. 2B), when investigating between-sample diversity (beta diversity) we observed a significant pattern related to site specificity ( Fig. 2C; PERMANOVA, p < 0.001). Further, we detected a significant correlation with geographic distance in sandstone communities ( Fig. 2D; Mantel r = 0.197, p < 0.001), such that communities are more dissimilar with increasing distance. Combined these results indicate possible latitudinal spatial structuring of viral communities. In further support of this, we were able to detect only 41.0% of vOTUs at more than one site, with 29.4% of vOTUs detected across two or more geographic regions and only 1.45% detected across all regions. Of the vOTUs detected across all regions, the majority were in unique VCs (66.7%) and none were in VCs with reference data. We hypothesize that this viral spatial structuring reflects the reported dispersion limitation and local composition and adaptation of hosts in these communities [14,15]. Similar spatial structuring has also been observed in grassland soil viromes, purportedly as a result of local assembly dynamics [41,64].

Conclusions
This study represents the most exhaustive geographic endeavor to date to capture the viral genomic diversity across ice-free regions of Antarctica and the first large-scale effort to explore the virosphere in endolithic communities. This catalog is a comprehensive repository for exploring the diversity, function, spatial ecology, and host-virus dynamics of this enigmatic continent. We also unveiled a potential influence of some viruses on carbon, energy, and nitrogen metabolism under conditions of oligotrophy up to the limit for life sustainability. Finally, this work may serve in the future as an important first step towards exploring adaptability of microbial communities in extreme conditions on Earth.
Additional file 1: Table S1. Sample information and viral identification statistics from Antarctic metagenomes. Table S2. Final viral catalog sequence information. Table S3. DRAM-v distilled annotations for predicted phage AMGs. Figure S1. Taxonomic classification of viral clusters (VCs) that include reference genomes. Figure S2. Comparison of host predictions between MAG and RefSeq databases.